Schibsted YAMS

How to build and maintain a thousands/req service with minimal dedication

daniel(dot)caballero at schibsted(dot)com

Who are you?



Daniel Caballero

Devops/SRE Engineer @ Schibsted

Part time (Devops) lecturer @ La Salle University

So... I work

... I (some kinda) teach

... I (try to) program...

... I (would like to) rock...

... and I live

So... I value my time (a lot)

And I really don't like to waste it

  • In resolving incidents
  • In repetitive work

Schibsgrñvahed..WHAT??

What is Schibsted?

And SPT?

It's about convergence through global solutions

What's behind global components / services?

You build it, you run it

Nothing new in the horizon probably for you

That means there's no ops/support/systems/devops team.

{
    "format": "webp",
    "watermark": {
        "location": "north",
        "margin": "20px",
        "dimension": "20%"
    },
    "actions": [
        {
            "resize": {
                "width": 300,
                "fit": {
                    "type": "clip"
                }
            }
        }
    ],
    "quality": 90
}

This may sound familiar to you...

CDNs able to transform contents:

  • As a native functionality...
  • Or through lambdas

SaaS solutions:

  • imgix
  • libpixel
  • Cloudinary

Opensource solutions:

So why?

  • Sites were doing already that. So saving sites time
  • Close to the Schibsted sites
    • not just latency (multiregion); also feature-set, compliance...
  • Cost effective
    • SaaS are expensive at Schibsted scale
    • We can build & maintain what we really need
  • Adapting to other needs:
    • Document transformation
    • Video streaming

Why not offline transformations?

Lots of (user) contents given the classified business. Sites are dynamic by nature. Some of them do adapt the request to the device. Blocking redesigns or improvements because lack of capacity to reprocess

What are you very proud about?

High usage

Does not require high maintenance

(Almost) No incidents. We would be able to maintain this with half an engineer

  • We dont like to cut people in half, so let's say one engineer

But be careful: if you stop developing a service, you kill a service

  • Stops being competitive
  • It quickly becomes legacy (old stack, old libs, old design)
  • Disconnect from current business needs

So we try to convince the company it requires, at least, the focus of two engineers.

Low costs

Low latency

How did you achieved that?

Combination of...

  • Team
  • Product management
  • Technical

Team

Autonomy

Benefiting from other Sch services

Big department portfolio:

  • AWS bootstrap
  • Vulnerability scans
  • TravisCI, Artifactory

Collaboration + transparency mindset

Internal RFCs Consumers as contributors Internal opensource model (full visibility of Github repos)

Product

Actual need

Limited scope

API as the point of interaction No business logic. "Dumb" service Almost no-functionality that is used by a single site or no-one

Tech

Everything as code

No space for "one time" actions.

  • Alerting configuration by code
  • Infrastructure updates

Good design choices

(but not perfect / or the best, for sure)

  • Immutable pattern
  • AWS + Netflix stack + Microservices
  • libvips

Continuous Delivery

And capacity to incorporate everything to the pipeline. > Looking forward, rather than investing lots of time in your rollback strategy

Small deltas. Iterative deliveries. Low risk deployments.

0-error target

Yeah, Google SRE book and error budgets...

... but helped us to understand, tune, and get the trust from Sch sites, avoiding major disruptions when big sites onboarded, and minimizing the chance of "unplanned / reactive" activities

We also rely in a "good enough" test suite (unit+integration+acceptance) with a good coverage of all API-functionality

  • New error conditions means new tests
  • If tests are green, almost (TM) no space for surprises

Troubleshooting toolkit

And what did you do wrong?

The Refactor (TM)

Complete refactor. New platform in parallel to deliver a new version of the API

  • APIv0
  • APIv1

Microservices split

Domain driven design... coupling of some services

Nice solution... but

Why not docker/k8s?

  • Local tests
  • YAMS Portal/Frontend already there
  • Migration exercise

gRPC?

Why not a Service Mesh?

And Prometheus?

We may.

And it may be a good moment to consider opencensus.

Actual (& not so far) future

More elasticity to reduce costs

  • Changes in transformation rules means massive eviction
    • So we are a bit overscaled...
  • Better degradation and more efficient ASG triggers
    • Reusing cache if no capacity
    • Automatic ASG parameters adjustments
    • Minimize parallelization in the transformation pipe
    • Incoming queue

Extra compression

  • Currently libjpg-turbo
  • Good for performance, pretty decent results, but...
  • MozJPEG, api-compatible with libjpg
  • guetzli, from Google

Bringing the service closer to the business

  • Image uploader
  • Online image editor
  • Integration with data services
    • Automatic classification
    • Nudity detector
    • Car plate pixelation
  • More regions/cloud providers deployments

  • Video transcoding...

Actual transformation pipelines

More adoption?

Some major Marketplaces are not using the service, yet

Simulating dependencies failures

Hoverfly: similar in concept to the Simian Army from Netflix, but specialized in API degradations

Stress test as part of the pipeline

Before closing...

Are you going to opensource it?

  • Schibsted do support contribution to opensource projects
  • As well as releasing internal code
  • Problem: Not following a "contribute-first" approach
  • But already contributed to bimg, zuul, krakenD...

Are you going to offer this SaaS to other companies?

Latencymap

api noiser

Corollary

Be Rx in the code...

But not in real life

Great thanks...

Sch*

And especially...

Edge colleagues

Other Qs?

dan . caba at google (dot)com

Your opinion is very important to me

  • Find my lecture on the schedule in the eventory app
  • Rate and comment my performance

Thanks for your feedback, I will know what to improve